While you are working at your computer at NCEAS, you receive a frantic call from someone working on an OHI project:
THEY NEED HELP!
“No problem” you say, “I’ll get Melanie or Gage”…but they are no where to be found!
“Hold on, I’ll get Jamie or Casey or Courtney”…but they are also gone.
It is clearly going to be up to you to save the day!!
Please help them answer the following questions. You will work from this document, using the code chunks as a work space to work explore the data. Do NOT bother keeping the code chunks neat and organized, we want to see your work process in all its messy glory. However, I DO recommend including plenty of comments, mostly because I have found this helps me organize my thoughts.
You can ask us any questions along the way (slack or issues)!
Good luck!
Goal Read the latest OHI global data into the R working space (call the data object “scores”).
Hint Here is a link to the data: https://raw.githubusercontent.com/OHI-Science/ohi-global/draft/eez/scores.csv
Note There are scores for 2023, you can ignore these. These are just here as a placeholder until we calculate new ones for 2023 (they are the 2022 scores).
Question Why are there N=222 regions (see code below)…aren’t there supposed to be 220? Can you make this data more human readable by including the full goal name (vs. the 2 letter abbreviation) and the country name.
Hint 1 Here is a link to the official region list (the rgn_id column in this file matches the region_id column in the scores data): https://raw.githubusercontent.com/OHI-Science/ohi-global/draft/eez/spatial/regions_list.csv
Hint 2 Here is a table that includes goal names: https://raw.githubusercontent.com/OHI-Science/ohi-global/draft/metadata_documentation/ohi_model/tables/ten_goals.csv
Hint 3 Which region has all NA values for scores?
Answer
Antarctica (region 213) has all NA values for scores which explains one of the extra regions. Region 0 does not have a name, this represents the entire globe.
Question
Which region id corresponds to the United States?
The region_id for the United states is 163. There appears to be both a region id and an admin region id, which is also 163 for the US. However within the US admin_region id, there are other region ids, presumably areas that are not geographically within the US but are territories of the us. These additional region ids are printed below. Note: Discussed that admin region ids are rarely used.
#filter to the us
united_states <- scores_expanded %>% filter(str_detect(admin_country_name, "United States"))
#print the region id
unique(united_states$admin_rgn_id)
## [1] 163
all_us_regions <- c(unique(united_states$region_id))
#look at all of the US region ids
cat("Additional region_ids associated with the United states admin region include", all_us_regions)
## Additional region_ids associated with the United states admin region include 12 13 116 149 150 151 158 159 163
#looking for us only in main region id
#filter to the us
united_states_main <- scores_expanded %>% filter(rgn_name == "United States")
unique(united_states_main$region_id)
## [1] 163
Question When I look at the number of goals, N=19 show up….I thought there were supposed to be 10 goals!!! Can you explain what all these are? Also, why do some of the goal abbreviations have 2 letters and others 3 letters?
Hint 1 Review the methods doc here: https://raw.githack.com/OHI-Science/ohi-global/published/documents/methods/Supplement.html
Although there are 10 goals, four of the goals are calculated from two separate sub-goals. These sub goals are initially calculated on their own before they are merged to create the final goal, this explains 8 of the extra goals. All of the subgoals have 3 letter codes, while the final goals have two letter codes. The final goal “Index” represents the overall index score for each region, which is calculated by averaging the final goals.
Question There are 6 dimensions reported in this dataset (future, pressures, resilience, score, status, and trend). When I look at the number of records in each of these categories, I notice that there are more “score” values reported than “status” values (see working space below). Given that scores are calculated using status, it seems like these should have the same number of records. Can you figure out what is going on here. Is something going wrong?
It looks like only scores are calculated for index and not any of the other dimensions. This makes sense because the dimensions are already incorporated into the scores (score is made up of the other 5 dimenstions) for each goal before they are averaged into the index score. See Figure 2.1 in the methods.
Question Figure out which goals/subgoals have the most NA values for scores in 2021. Which ones have the least? Can you discern any reason why some goals have lots of missing values and others do not?
Hint Include only dimension = score, and year = 2022, and cut region_id = 0.
CS, NP, MAR, AO and TR all have high levels of NAs. The methods say that every region has to have data for a particular layer, unless it isn’t relevant.
CS: The methods state that carbon storage goals are NAs in regions that don’t have the coastal ecosystems that are evaluated for carbon storage.
NP, MAR, AO, ECO, LE, LIV: all relate to economies/livelihoods extracted from the area. These could be NAs in areas that are totally protected/ not inhabited by people. The methods mention food production goals often do not apply in uninhabited areas. However this is a little confusing because FP (which relates to seafood harvest) only has one NA. Fishing relates to where the fishing happens, so that explains why goals related to this rarely have NAs even if other food production goals are NA.
TR: relates to people experiencing/enjoying. I could see how this could be NA in areas that aren’t inhabited as well.
CP: “The amount of protection provided by marine and coastal habitats serving as natural buffers against incoming waves” The methods state that this can include mangroves, coral reefs, sea grasses, kelp forest sea ice, and salt marshes. I assume this goal has high NAs for the same reason as the CS variable, in that not all places have these types of habitats.
Ones with fewest NAs generally apply to all areas.
Question If we have a goal with a future status of 80 and status of 90…what is the score?
Hint Isolate the future, status, and score values for one region and one goal and see if you can identify a pattern.
The score would be 85. The methods state that the current and future status are averaged to create the goal score.
Project Based on your data exploration and other resources, provide some metadata that describes each variable. Write it so it would be useful to you in the future as you are looking through these data.
Write it in the following table. NOTE: Knit the document to see if the table is formatting correctly (but don’t bother if you do not know how to knit a document or if you are running into errors!).
| Variable | Description | Values |
|---|---|---|
| goal | Abbreviation for goal. Represent goals for ocean ecosystems which represent the full suite of benefits that people want and need from the ocean Includes main goals, sub-goals and overall index score. |
character 2 letter codes-10 goals of OHI 3 letter codes-subgoals used to calculate 4 of the goals Index-overall index score |
| dimension | variable used to calculate goals | current status predicted future status:
|
| region_id | For the global OHI, each region is defined as the offshore EEZ area for all coastal countries and territories. | 2020 total +(Antarctica not calculated) + Global Region 0 |
| score | Scores are calculated at the level of region and goal. Goal score represents an average of current and future status score. | -1 to 1 for trend 0-100 all other dimensions |
| year | Scenario year | 2012 to 2023 |
#looking at the range of scores for all dimensions except trend
score_dimension <- scores %>% filter(dimension != "trend")
max(score_dimension$score, na.rm = TRUE)
## [1] 100
min(score_dimension$score, na.rm = TRUE)
## [1] 0
#looking at the range of scores for trend
score_dimension <- scores %>%
filter(dimension == "trend")
max(score_dimension$score, na.rm = TRUE)
## [1] 1
min(score_dimension$score, na.rm = TRUE)
## [1] -1
Project Create a scatterplot that compares 2012 and 2022 scores for each region for the artisanal opportunities goal. Based on this, do scores for the artisanal opportunities goal appear to have increased or decreased over time?
It appears that the scores have increased by a small amount between 2012 and 2022.
#scatterplot
artisinal <- scores_expanded %>%
filter(goal_name == "Artisanal Fishing Opportunity" & dimension == "score") %>%
filter(year %in% c(2012,2022) & region_id != 0)
mean_scores <- artisinal %>%
group_by(year) %>%
summarize(mean_score = mean(score, na.rm = TRUE))
ggplot(artisinal, aes(x = region_id, y = score, col = score)) +
geom_point() +
geom_hline(data = mean_scores, aes(yintercept = mean_score), linetype = "dashed", color = "red") +
facet_grid(~ year) + labs(title = "Artisinal Fishing Opportunity Scores for All Regions", subtitle = )
# Create interactive scatterplot using plotly
artisinal %>% group_by(year) %>% do(p=plot_ly(., x = ~region_id, y = ~score, type = "scatter", mode = "markers", hoverinfo = "text", text = ~rgn_name, name = toString(unique(.$year)))) %>%
subplot(nrows = 1, shareX = TRUE, shareY = TRUE) %>%
layout(title = "Artisinal Fishing Opportunity Scores for All Regions")
Create a plot that shows the distribution (e.g., histogram or barplot or ??) of Index scores of all countries.
#make a histogram
## Code to create a plot showing distribution of country Index scores
scores_expanded %>% filter(goal_name == "Artisanal Fishing Opportunity" & dimension == "score") %>%
filter(year!= "2023") %>% #remove 2023 since this is a placeholder
ggplot() + geom_histogram(aes(x = score), fill = "skyblue") +
labs(title = "Distribution of OHI Scores", subtitle = " Artisinal Fishing Opportunity 2012-2023")
scores_expanded %>% filter(goal_name == "Artisanal Fishing Opportunity" & dimension == "score") %>%
filter(year!= "2023") %>% #remove 2023 since this is a placeholder
ggplot() +
geom_density(aes(x = score), fill = "skyblue") +
labs(title = "Density of OHI Scores", subtitle = " Artisinal Fishing Opportunity 2012-2023")